-
Notifications
You must be signed in to change notification settings - Fork 49
Fix scalar matching in grapheme semantic mode #565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previously we would allow consuming a single scalar in grapheme semantic mode for a DSL `.scalar`. Change this behavior such that it is treated the same as a character with a single scalar. That is: - In grapheme mode it must match an entire grapheme. - In scalar semantic mode, it may match a single scalar.
Convert AST scalars to DSL scalars such that we can preserve the `\u{...}` syntax where the user chooses to use it. This requires fixing up some escaping logic such that we don't escape `\u{...}` sequences.
@swift-ci please test |
let r3 = Regex { | ||
"👨" as UnicodeScalar | ||
"\u{200D}" as UnicodeScalar | ||
"👨" as UnicodeScalar | ||
"\u{200D}" as UnicodeScalar | ||
"👧" as UnicodeScalar | ||
"\u{200D}" as UnicodeScalar | ||
"👦" as UnicodeScalar | ||
} | ||
XCTAssertNil(try r3.firstMatch(in: "👨👨👧👦")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually I'm not sure whether this is correct, as we allow matching for e.g:
print("👨👨👧👦".firstMatch(of: /👨\u{200D}👨\u{200D}👧\u{200D}👦/)
should we be combining the scalars in a DSL concatenation the same way we combine them in a literal? And if so, would it also apply to regex scalar children e.g /\u{...}/
?
Or is a DSL concatenation semantically more like the following?
print("👨👨👧👦".firstMatch(of: /(?:👨)(?:\u{200D})(?:👨)(?:\u{200D})(?:👧)(?:\u{200D})(?:👦)/)) // nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scalars should get globbed together into a stream of scalars, over which grapheme breaking can occur. This is similar to scalar escaped inside string literals. For the DSL, I'd treat them like a stream of scalars ala /\u{...}\u{...}\u{...}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah makes sense, though we'll need to define exactly when the combining happens, e.g do we combine the following?
Regex {
"a" as UnicodeScalar
/\u{301}/
}
and would that still happen if there's further contents in the regex? (because this is now a concat within a concat) e.g:
Regex {
"a" as UnicodeScalar
/\u{301}b/
}
Should Regex {...}
calls serve as a boundary between combining scalars? e.g:
Regex {
"a" as UnicodeScalar
Regex {
"\u{301}" as UnicodeScalar
"b"
}
}
I'm still not convinced that we should have this scalar api while in grapheme semantics at all, converting it to a matching a character under the hood seems unintuitive from a user's perspective. What does a scalar RegexComponent even mean then? If it's given a single scalar character it'll match it as a character and if it's given something that isn't a character (like Would it be possible for us to instead throw an error if we have a |
I don't disagree with this, though I should note that it's equivalent to using a
Yeah, I'm starting to think we should maybe consider retracting these scalar APIs on RegexBuilder until we work out these details (there's only 2 of them AFAIK, the
Yeah, that would be a possible option, though it would surface as a runtime crash on the first call to a matching function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I think we can clean up the code gen path a little.
// A scalar always matches the same as a single scalar character. This | ||
// means it must match a whole grapheme in grapheme semantic mode, but | ||
// can match a single scalar in scalar semantic mode. | ||
try emitCharacter(Character(s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be conditional on whether we're in grapheme semantics or scalar semantics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emitCharacter
will call emitConsumeScalar
for scalar mode
// A scalar always matches the same as a single scalar character. This | ||
// means it must match a whole grapheme in grapheme semantic mode, but | ||
// can match a single scalar in scalar semantic mode. | ||
return try Character(s).generateConsumer(opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems pretty convoluted. Why do we not just emit a scalar consuming instruction in scalar semantic mode and set the grapheme boundary bit? CC @rctcwyvrn to see if we need to get some of her instructions in first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do that within a consumer matching function? I know we eventually want to migrate off consumer functions, but for now this will generate a consumeScalar
consumer in scalar semantic mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path should only be reached for non-ascii custom character classes so I don't think we can emit an instruction instead of a consume fn in that case. Maybe we could if we made custom character classes emit in the same way we do alternations?
let r3 = Regex { | ||
"👨" as UnicodeScalar | ||
"\u{200D}" as UnicodeScalar | ||
"👨" as UnicodeScalar | ||
"\u{200D}" as UnicodeScalar | ||
"👧" as UnicodeScalar | ||
"\u{200D}" as UnicodeScalar | ||
"👦" as UnicodeScalar | ||
} | ||
XCTAssertNil(try r3.firstMatch(in: "👨👨👧👦")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scalars should get globbed together into a stream of scalars, over which grapheme breaking can occur. This is similar to scalar escaped inside string literals. For the DSL, I'd treat them like a stream of scalars ala /\u{...}\u{...}\u{...}
.
Edge case I ran into while updating #525 Should a single scalar composed value match the decomposed version while in grapheme mode? Ie In the scalar optimizations branch I tried to make it match this PR's scalar matching by emitting I'm leaning towards that it shouldn't be able to match the decomposed version, as we asked it to match a scalar and it would end up matching a completely different scalar sequence
|
I think I'm leaning towards saying it actually should match the decomposed version. IMO the I do agree that it could potentially be unexpected tho. |
@@ -216,7 +216,7 @@ extension AST.Atom { | |||
|
|||
switch self.kind { | |||
case let .char(c): return .char(c) | |||
case let .scalar(s): return .char(Character(s.value)) | |||
case let .scalar(s): return .scalar(s.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the behaviour of regex to be different depending on if we input a unicode value as its scalar code or as a literal character? ie "ợ"
vs "\u{1ee3}"
? With this one would be interpreted as a .char
and the other as a .scalar
in the DSLTree. I think the intent for the original code here was to make those two inputs equivalent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they should be semantically equivalent (i.e they go through the exact same code paths in byte code emission), but this distinction allows us to have more accurate printing of the DSL tree when performing a literal -> DSL conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(if we didn't need the distinction for printing, IMO it would make sense to outright remove .scalar
from the DSL tree)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they behave the same then why would we need to print them differently? Printing them differently would imply that they behave differently I think.
Also yea, seeing as we're moving towards making a .scalar
behave exactly the same as .char
, maybe it would make sense to remove it outright
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they behave the same then why would we need to print them differently? Printing them differently would imply that they behave differently I think.
For example, if the user wrote /e\u{301}/
, ideally the transformed DSL tree would be:
Regex {
"e\u{301}"
}
instead of:
Regex {
"é"
}
Semantically they're the same, but IMO it's much nicer to preserve the original syntax the user wrote.
Yes, in grapheme cluster semantics we do Unicode Canonical Equivalence. |
Ok so #525, is ready to merge and up to date with this branch. Should we go ahead and merge that one? @hamishknight were you planning on doing additional work on combining scalars together? |
@rctcwyvrn I'm happy with landing both, I can work on the scalar combination logic in a follow-up |
Let's merge this whenever you're ready @rctcwyvrn, and we can coordinate how to land @hamishknight's work on top and cherry-pick things together |
@swift-ci please test |
- Adds new instructions for matching characters and scalars case insensitively - Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode - Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings - Also includes fixes to scalar matching in grapheme semantic mode (#565)
Changes merged in 7752047 |
) - Adds new instructions for matching characters and scalars case insensitively - Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode - Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings - Also includes fixes to scalar matching in grapheme semantic mode (swiftlang#565)
) - Adds new instructions for matching characters and scalars case insensitively - Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode - Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings - Also includes fixes to scalar matching in grapheme semantic mode (swiftlang#565)
) - Adds new instructions for matching characters and scalars case insensitively - Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode - Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings - Also includes fixes to scalar matching in grapheme semantic mode (swiftlang#565)
) - Adds new instructions for matching characters and scalars case insensitively - Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode - Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings - Also includes fixes to scalar matching in grapheme semantic mode (swiftlang#565)
) - Adds new instructions for matching characters and scalars case insensitively - Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode - Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings - Also includes fixes to scalar matching in grapheme semantic mode (swiftlang#565)
Previously we would allow consuming a single scalar of a grapheme in grapheme semantic mode when matching a DSL
.scalar
. Change this behavior such that it is treated the same as a character with a single scalar. That is:Additionally, start converting AST
.scalar
atoms into.scalar
DSL atoms. This is both for consistency with the DSL, and allows us to preserve the\u{...}
syntax in more cases when a DSL conversion is performed.Resolves #563
Resolves #564
rdar://96662651